An Introduction to Bayesian Regression and Hierarchical Models

Dublin Data Science

Mick Cooney

2019-08-21

Introduction

Diamonds


carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39

US Radon Measurements


## # A tibble: 12,777 x 26
##    idnum state state2 stfips zip   region typebldg floor  room basement windoor
##    <chr> <chr> <chr>  <chr>  <chr> <chr>  <chr>    <int> <int> <chr>    <chr>  
##  1 1     AZ    AZ     4      85920 1      1            1     2 N        <NA>   
##  2 2     AZ    AZ     4      85920 1      0            9     0 <NA>     <NA>   
##  3 3     AZ    AZ     4      85924 1      1            1     3 N        <NA>   
##  4 4     AZ    AZ     4      85925 1      1            1     3 N        <NA>   
##  5 5     AZ    AZ     4      85932 1      1            1     1 N        <NA>   
##  6 6     AZ    AZ     4      85936 1      1            1     1 N        <NA>   
##  7 7     AZ    AZ     4      85936 1      0            9     0 <NA>     <NA>   
##  8 8     AZ    AZ     4      85936 1      1            1     1 N        <NA>   
##  9 9     AZ    AZ     4      85938 1      0            9     0 <NA>     <NA>   
## 10 10    AZ    AZ     4      85938 1      1            1     3 <NA>     <NA>   
## # … with 12,767 more rows, and 15 more variables: rep <chr>, stratum <chr>,
## #   wave <dbl>, starttm <chr>, stoptm <chr>, startdt <chr>, stopdt <chr>,
## #   activity <dbl>, pcterr <dbl>, adjwt <dbl>, dupflag <chr>, zipflag <chr>,
## #   cntyfips <chr>, county <chr>, log_radon <dbl>

Basic Regression Theory

Formulating the Problem


Observations drawn from Normal distribution


\[ y \sim \mathcal{N}(\mu, \sigma) \]

Predictors determine mean, \(\mu\)


Choose functional form for this relationship

Linear Model


\[ f(x) \to \beta \mathbf{X} \to \mu \]

Consequence


Each point drawn from an individual distribution


\[ y_i \sim \mathcal{N}\left(\sum \beta_j x_{ji}, \, \sigma\right) \]

Model fit \(\rightarrow\) determine values for \(\beta\)

Model Fitting


How do we determine \(\beta\)?

Calculate \(\mu\) from data


Calculate log-likelihood of measurement, \(\mathcal{L}(y \, | \, \mu, \sigma)\)


Sum over all datapoints

Parameter uncertainty

Bayesian Regression

Bayesian Inference Engine


Prior Knowledge

\(+\)

Data


\(=\)


Posterior Knowledge

Parameters, \(\theta\)


Data, \(D\)

Prior: \(p(\theta)\)


Likelihood: \(p(D | \theta)\)


Posterior: \(p(\theta | D)\)

\[ p(\theta \, | \, D) = \int p(\theta) \, p(D \, | \, \theta) \]


Posterior calculation is high-dim integral

Use MCMC to sample posterior

Stan, rstan, and rstanarm / brms


Probabilistic Programming Language


CmdStan, PyStan, rstan


rstanarm and brms

## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
## 
## Energy:
## E-BFMI indicated no pathological behavior.

Using Model Outputs


How do we use these posterior draws?

Negative prices?


Use logs

## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
## 
## Energy:
## E-BFMI indicated no pathological behavior.

A word of caution…

Predicting log(price)

What about price?

Induce a uniform distribution?

Hierarchical Models

What about small data?


How do we use structure?

Fully-Pooled Model

## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
## 
## Energy:
## E-BFMI indicated no pathological behavior.

Unpooled Model

## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
## 
## Energy:
## E-BFMI indicated no pathological behavior.

Larger estimates \(\to\) smaller sample size

What about partial pooling?

\[\begin{eqnarray*} \text{Larger samples} &\to& \text{individual estimates} \\ \text{Smaller samples} &\to& \text{grouped estimates} \end{eqnarray*}\]

Partial Pooling

## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
## 
## Energy:
## E-BFMI indicated no pathological behavior.

Partial Pooling with Predictors

## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 15.
## 
## Energy:
## E-BFMI indicated no pathological behavior.

Stan Programs

Domain Specific Language


Compiles to C++


Documentation improving rapidly

## data {
##   int<lower=1> N;
##   int<lower=1> J; # number of counties
##   int<lower=1,upper=J> county[N];
##   vector[N] x;
##   vector[N] y;
## }
## parameters {
##   vector[J] a;
##   real beta;
##   real<lower=0> sigma_a;
##   real<lower=0> sigma_y;
##   real mu_a;
## }
## model {
##   vector[N] y_hat;
##   for (i in 1:N)
##     y_hat[i] = beta * x[i] + a[county[i]];
## 
##   beta ~ normal(0, 1);
##   mu_a ~ normal(0, 1);
##   sigma_a ~ cauchy(0, 2.5);
##   sigma_y ~ cauchy(0, 2.5);
## 
##   a ~ normal(mu_a, sigma_a);
##   y ~ normal(y_hat, sigma_y);
## }

Flexible


Censored, truncated data


Generative modelling

Conclusion

Problems and Shortcomings


Not a magic bullet


Implementation more complex (by design)

Future Improvements


Ecosystem developing rapidly


PyStan and arviz

Further Resources


Stan documentation


Statistical Rethinking


rstanarm, tidybayes, bayesplot, shinystan

Questions?


Email:


GitHub:

https://github.com/kaybenleroll/data_workshops